Overview
We often collect data from two or more groups. Group allocations can be stored as categorical variables, and we often want to explore and compare differences between groups, using these variables.
Principles/ideas
- Types of data (and impact on how R presents it)
- Different purposes of plots/graphics
- Purpose and design of data graphics
Techniques covered
- Making boxplots
- Adding color to a plot
- Using different types of data and visual scales
- Facets?XXX
- Using
group_by()to make a summary table and compare groups - Aggregating repeated measures data?? XXX
Boxplots
TODO: replace with video
Video summary
- Boxplots are useful for visualising categorical data.
- The function
geom_boxplot()represents data on the y-axis using boxes, whiskers and points for each category on the x-axis.
Code examples
mpg %>%
ggplot(aes(class, cty)) + # choose data: x = class, y = cty
geom_boxplot() # draw the boxplotBoxplots are useful for visualising categorical data. In a boxplot, the thick line is the median. That thick line is enclosed inside a rectangle (the ‘box’), and the size of the box indicates the inter-quartile range (IQR). The IQR contains the middle 50% of the ordered data. A wider IQR indicates greater variation in a dataset.
The top and bottom of the box are called ‘hinges’. The vertical lines connected to each hinge are called ‘whiskers’, and give some indication of the broader range of the data. Exactly what the whiskers show differs depending on the particular options you use to draw a boxplot. In this case, the upper whisker shows the largest data point that is no more than 1.5 times the IQR above the upper hinge. The lower whisker is the lowest point no more than 1.5x the IQR below the lower hinge. In this version of a boxplot, any data point outside the range of the whiskers is described as an ‘outlying point’ and is shown individually as a dot.
Exercise 9
- Create a new chunk at the bottom of your worksheet.
- Create a boxplot with
drv(front-wheel/rear-wheel/4-wheel drivetrain) on the x-axis andhwyon the y-axis. - Run the chunk.
Your boxplot should look like this:
Using colour
TODO: replace with video
Video summary
- The points in a scatterplot can be coloured based on a categorical variable.
- This is done using by giving
aes()acolouroption.
Code examples
# colour points using a different colour for each drv (drivetrain) category
mpg %>%
ggplot(aes(cty, hwy, color = drv)) +
geom_point()Colour can be added to a scatterplot to categorise the points. So far, you’ve used the aes() function to define which variable is plotted on the x and y axes. The aes() function is short for ‘aesthetics’, and it is a way to map variables to visual aspects of a plot. In addition to mapping variables to the x and y axes, the colour option maps colours to a categorical variable.
We could enhance the scatterplot showing mpg for city and highway driving, by adding a colour for each drivetrain.
mpg %>%
ggplot(aes(cty, hwy, color = drv)) +
geom_point()Exercise 10
- Create a new chunk at the bottom of your worksheet.
- Create a scatterplot using the
gapminderdataset withgdpPercapon the x-axis,lifeExpon the y-axis, andcontinentin colour. - Run the chunk.
Your plot should look like this:
Types of variables and visual scale
The video introduces the link types of variable (e.g. continuous, categorical, text), the data-types which R uses to store them, and the way that ggplot presents them on the scales of a plot.
- Common types of variable are: numeric, categorical and text (string)
- Internally, R stores data in a number of different data types.
- These data types mostly match up the different types of variable — but not always, so watch out!
- For example, sometimes numeric data can get stored as text by accident (we would need to convert this)
- Categorical variables can be stored as either factors or as text/strings (again, we can convert between them as needed)
ggplot(and other R functions) use data-types as a clue to choose defaults for the scales of your graphs- Normally the defaults are good; sometimes it’s helpful to manually adjust the scale by switching the data type
The following R code is used in the video:
# load the tidyverse package
library(tidyverse)
# ... etc etcData types
Data comes in all shapes and sizes, but and important distinction researchers make is between types of variable.
You might have seen terms like these:
- interval or continuous variables data (also called real numbers)
- ordinal variables (e.g. Likert style 1-7 responses, sometimes called factors in experimental designs)
- count variables (whole numbers greater than or equal to zero)
- nominal or categorical variables, which are also sometimes
These data-types mostly match up the different types of variable — and the names will be the same — but it’s not always the case. Sometimes we need to convert between data types.
In R there are three main data-types you need to know about:
Numeric data, which are stored as a ‘double’, abbreviated
dbl. ‘Double’ means ‘double precision number’, which is computer speak for ‘any kind of real number, even a very large one’.Categorical data, which is stored as a factor
Text, which is stored with the ‘character’ data-type (abbreviated
chr)
You might also encounter these data types, but we don’t specifically need them for this course:
- Boolean data (true/false values)
- Dates (a special kind of numeric data, which R formats nicely as a date for us)
- Ordinal variables (a special type of factor where the categories are ordered)
If you have a variable in R you can use the typeof function to check which data-type it is. For example:
typeof(1)
[1] "double"
typeof("apple")
[1] "character"You can also see which data-type is used to store a variable when using the glimpse command you saw in session 1 (e.g. here).
iris %>% glimpse
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, …In the glimpse output you can see the variable names listed on the left, followed by grey text surrounded by angle brackes, e.g.: <dbl> which is the abbreviated data-type.
In this built-in dataset, most of the data is numeric (dbl), but the Species variable is categorical, and stored as a factor (fct).
Data types and scales on graphs
If we look at the mtcars data we can see that all the variables are stored as numeric data (dbl):
mtcars %>% glimpse()
Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 1…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 18…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 1…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0…
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2…This is fine if we want to make a scatter plot (here of ‘miles per gallon’ vs weight of the car):
mtcars %>%
ggplot(aes(wt, mpg)) +
geom_point() In this plot both the x and y axes are continuous. That is, they are numeric variables, using real numbers.
However, if we want to make a boxplot of the mpg variable using am as the x axis then we have a problem:
mtcars %>%
ggplot(aes(am, mpg)) +
geom_boxplot()
Warning: Continuous x aesthetic -- did you forget aes(group=...)? We might have expected to see;
- miles per gallon on the y axis
- two separate boxes, one for automatic cars and another for manual.
This doesn’t work as expected though.
Because ggplot has seen that am is stored as numeric data, it creates a continuous scale on the x axis, and draws a single box at the midpoint of all the values of am. Because am ranges from 0 to 1, this box appears at 0.5.
As glimpse showed us, the variable am is stored as numeric data, with type dbl (short . However, we really want to use am as a categorical variable. So we should store it as a factor. If we convert it to a factor then our plot will work properly.
We can use the command factor(am) to tell R that the x-axis is a factor:
mtcars %>%
ggplot(aes(factor(am), mpg)) +
geom_boxplot()This gives us the boxplot we were expecting. The only change here is to replace am with factor(am). This tells R to convert the variable am to a factor. ggplot can then draw the x axis correctly.
You need to learn these datatypes and abbreviations:
| | Data type | Abbreviation | Used for | |
|---|
| | double | dbl | Numeric data (e.g. interval or continuous variables) | | character| chr | text data, and sometimes also categorical variables | |||| |||| |||| |||| |
- Hide this page and test a friend or someone next to you in the room on what each of them means.
- Repeat this in 20 minutes time to check you still have it (spaced repetition is effective).
Exercise XXX
Use mtcars to make a boxplot showing miles per gallon on the y axis, and number of gears the car has on the x axis (gear).
Your plot should look like this:
XXX ADD EXTENSION EXERCISES TO DO THE SAME WITH COLOR SCALES>>> E.G.
mtcars %>%
ggplot(aes(wt, mpg, color=gear)) +
geom_point()
mtcars %>%
ggplot(aes(wt, mpg, color=factor(gear))) +
geom_point()Grouping with group_by
The video explains and gives examples showing that:
- Datasets often contain categorical variables
- We often want to compare statistics (like averages) between categories
- The
group_byfunction is a quick way to combine filtering and summarising group_bycreates a grouped dataframe- Using grouped dataframes with other functions (e.g.
summarise) applies them once-per-group - The result is always a new dataframe
The following R code is used in the video:
# mtcars has the `gear`, `cyl` and `am` variables, which could be treated as
# either categorical or numeric
mtcars %>% select(gear, cyl, am) %>% head
gear cyl am
Mazda RX4 4 6 1
Mazda RX4 Wag 4 6 1
Datsun 710 4 4 1
Hornet 4 Drive 3 6 0
Hornet Sportabout 3 8 0
Valiant 3 6 0
# we previously made a box-plot broken down by a category
mtcars %>%
ggplot(aes(factor(gear), mpg)) +
geom_boxplot()
# we can use filter to calculate averages for each category
mtcars %>%
filter(gear == 4 ) %>%
summarise(mean(mpg))
mean(mpg)
1 24.53333
mtcars %>%
filter(gear == 5 ) %>%
summarise(mean(mpg))
mean(mpg)
1 21.38
# ... and so on. However, this gets repetitive with many groups.
# Instead we can use group_by to make a table with a row for each group
mtcars %>%
group_by(gear) %>%
summarise(mean(mpg))
# A tibble: 3 x 2
gear `mean(mpg)`
<dbl> <dbl>
1 3 16.1
2 4 24.5
3 5 21.4
# We can add standard deviations (or other stats) to the same table and give each column a name
mtcars %>%
group_by(gear) %>%
summarise(Mean = mean(mpg), SD = sd(mpg))
# A tibble: 3 x 3
gear Mean SD
<dbl> <dbl> <dbl>
1 3 16.1 3.37
2 4 24.5 5.28
3 5 21.4 6.66XXX THIS WAS COPIED FROM SESSION 2 … MIGHT NEEDS SOME RE-JIGGING
Grouping data with group_by()
TODO: replace with video
Video summary:
- Our data may have categorical or ‘grouping’ variables (e.g. gender, or country).
- We often want to create summaries for each group.
- We could use
filter()andsummary()once for each group, but thegroup_by()function does this for all groups. - Adding
group_by()to a pipeline runs the subsequent steps once for each group. - Be careful only to group by categorical variables.
# boxplot of C02 uptake grouped by grass type
CO2 %>%
ggplot(aes(Type, uptake)) +
geom_boxplot()
# table of C02 uptake grouped by grass type
CO2 %>%
group_by(Type) %>%
summarise(average_uptake = mean(uptake))
# A tibble: 2 x 2
Type average_uptake
<fct> <dbl>
1 Quebec 33.5
2 Mississippi 20.9
# group by two factors at once: grass type and experimental treatment
CO2 %>%
group_by(Type, Treatment) %>%
summarise(mean(uptake))
# A tibble: 4 x 3
# Groups: Type [2]
Type Treatment `mean(uptake)`
<fct> <fct> <dbl>
1 Quebec nonchilled 35.3
2 Quebec chilled 31.8
3 Mississippi nonchilled 26.0
4 Mississippi chilled 15.8In this video we’ll use a dataset about plants rather than cars. Plants photosynthesise by combining sunlight with carbon dioxide to make sugars. The CO2 dataset carbon dioxide update for two species of grass. The species is a factor.
We might make a plot like this:
CO2 %>%
ggplot(aes(Type, uptake)) +
geom_boxplot()But what if we want these numbers in a table (or to report in our report)? We can do that using group_by and summarise…
CO2 %>%
group_by(Type) %>%
summarise(average_uptake = mean(uptake))
# A tibble: 2 x 2
Type average_uptake
<fct> <dbl>
1 Quebec 33.5
2 Mississippi 20.9Another factor in this dataset is an experimental treatment – whether the grasses were chilled or nonchilled. We can also group by two factors at once and get a row for each combination:
CO2 %>%
group_by(Type, Treatment) %>%
summarise(mean(uptake))
# A tibble: 4 x 3
# Groups: Type [2]
Type Treatment `mean(uptake)`
<fct> <fct> <dbl>
1 Quebec nonchilled 35.3
2 Quebec chilled 31.8
3 Mississippi nonchilled 26.0
4 Mississippi chilled 15.8Exercise 16
chickwts contains data for the weights of chicks (in grams) fed on different diets.
glimpse(chickwts)
Rows: 71
Columns: 2
$ weight <dbl> 179, 160, 136, 227, 217, 168, 108, 124, 143, 140, 309, 229, 18…
$ feed <fct> horsebean, horsebean, horsebean, horsebean, horsebean, horsebe…Calculate the mean and standard deviation chick weights for each type of feed.
The mean weight of chicks fed on linseed was (to 2 decimal places) g.
The standard deviation of chicks fed on sunflower was (to 2 decimal places) g.
Use the built-in iris dataset
Use group_by to calculate the average Sepal.Length of each Species of flower.
Extension exercises
Extension exercise XXX
Make a scatterplot of the diamonds data. Show carat on the x-axis, price on the y-axis and the clarity of the diamond in colour. Try to produce your plot before comparing it against the answer using the button below.
Extension exercise XXX
Make a scatterplot of the mpg data. Show city mpg on the x-axis, highway mpg on the y-axis and the vehicle class in colour. Try to produce your plot before comparing it against the answer using the button below.
Extension exercise 1
Make a boxplot showing life expectancy by continent for years greater than 1999. (Hint: use filter(), ggplot() and geom_boxplot().)
The plot should look like this:
Extension exercise XXX
This boxplot uses the gapminder dataset to show lifeExp (life expectancy) on the y-axis for each continent on the x-axis.
In a new chunk, write the R code to produce this plot.
Extension exercise XXX
Create a boxplot which shows drivetrain on the x-axis and miles per gallon when a car is driven in the city on the y-axis. Your plot should look like this:
Check your knowledge
Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers are revealed in Session 4.
- Which function makes a boxplot?
- What is the difference between a
dbland afctorord? - Give an example of when the difference between
dblandfctmatters when making a plot? (include code examples for this if you can) - How can you convert a variable from a
dblto afct? - How could you calculate the mean for one level of a factor?
- How would you calculate the mean for all levels of a factor?